Investigating the Mini-Challenge 2 of VAST Challenge 2021
Team Member:
Syed Ahmad Zaki, Singapore Management University of Singapore, ahmadzaki.2020@mitb.smu.edu.sg
Student Team: YES
Tools Used:
Rmarkdown
Approximately how many hours were spent working on this submission in total?
Provide an estimate of the total number of hours worked on this submission by your entire team.
May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2021 is complete?
YES
Video
Provide a link to your video. Example: http://www.westbirmingham.ac.uk/uwb-smith-mc2-video.wmv
As a visual analytics expert assisting law enforcement, your mission is to identify which GASTech employees made which purchases and identify suspicious patterns of behavior. You must cope with uncertainties that result from missing, conflicting, and imperfect data to make recommendations for further investigation.
Use visual analytics to analyze the available data and develop responses to the questions below. In addition, prepare a video that shows how you used visual analytics to solve this challenge. Submission instructions are available here. Entry forms are available for download here.
A cursory look at the dataset reveals the following data types:
| Data | Type | Description |
|---|---|---|
| Credit Card.csv (cc) | Aspatial | Credit card txns by timestamp, location and amt |
| Loyalty.csv | Aspatial | Loyalty card txns by date, location and amt |
| Car Assignment.csv (car) | Aspatial | Car assignment ID with individuals’ name and role |
| MC2.jpg | Aspatial | Abila’s map in jpeg format |
| MC2.tif | Geospatial | Abila’s map in a geotiff format |
| GPS.csv (gps) | Geospatial | GPS points (latlong) by car ID and timestamp |
| Abila | Geospatial | Abila’s road network |
| Kronos Island | Geospatial | Polygon showing Kronos Island’s admin boundary |
An in-depth look at the dataset reveals the following fields:
| File Name | cc | loyalty | gps | car | mc2 |
|---|---|---|---|---|---|
| File Type | csv | csv | csv | csv | pic |
| Count | 1,490 | 1,392 | 685,169 | 44 | - |
| Date Format | m/d/y | m/d/y | m/d/y | - | - |
| Time Format | h:m | - | h:m:s | - | - |
| Location | Yes | Yes | - | - | Yes |
| Price | Yes | Yes | - | - | - |
| last4ccnum | Yes | - | - | - | - |
| loyaltynum | - | Yes | - | - | - |
| ID | - | - | Yes | Yes | - |
| Latlong | - | - | Yes | - | - |
| Names | - | - | - | Yes | - |
| Employment Details | - | - | - | Yes | - |
Not all files have the same fields. While it’s easy to merge gps and car data using its unique ID, there are no unique fields tying the cc and loyalty data together. Thus, merging both cc and loyalty data together would require some form of fuzzy joining logic. Separately, to add to the complexity, we would need to identify the various locations within the gps data, using both the cc and mc2 map.
With these dataset in mind, the following considerations would need to be addressed:
There are a few ways to employ fuzzy matching in our dataset. One is to use the native adist function within R, but its processing time leaves much to be desired. The other is to use packages specifically designed for fuzzy matching. One such package that is built for speed in matching similar phrases is stringdist. It uses openMP for parallel computing to speed up its matching of unequal content. The only downside (though it’s hardly a downside) is that it requires the columns of comparison to be housed in the same dataframe. Fuzzyjoin, built on top of stringdist, allows comparison of columns housed in different dataset, and its output include a merging of both datasets.
Unfortunately, deciding on the fuzzy logic package is the easy part. The harder part is to decide on the appropriate fuzzy join logic. Here’s a list of distance metrics currently supported by stringdist:
| Method Name | Description |
|---|---|
| osa | Optimal string aligment, (restricted Damerau-Levenshtein distance) |
| lv | Levenshtein distance (as in R’s native adist). |
| dl | Full Damerau-Levenshtein distance. |
| hamming | Hamming distance (a and b must have same nr of characters). |
| lcs | Longest common substring distance |
| qgram | q-gram distance |
| cosine | cosine distance between q-gram profiles |
| jaccard | Jaccard distance between q-gram profiles |
| jw | Jaro, or Jaro-Winkler distance |
Out of the above methods, osa, lv and dl seems most apt, since we’re dealing with phrases with differing lengths and are more concerned with slight edits, realignment, addition and subtraction of letters within these phrases. We’ll rely on the osa method since it’s a balance between finding the right edits and speed.
ABC
ABC Calendar Heatmap (GPS Points per day) http://visualdata.wustl.edu/varepository/VAST%20Challenge%202014/challenges/MC2%20-%20Patterns%20of%20Life%20Analysis/entries/University%20of%20Buenos%20Aires%20-%20Tralice/ Parallel Coordinates Graph (Location, Date, Timestamp, Price, Name) https://rpubs.com/tskam/PCP Gantt Chart Explore Hippokampos RadViz Clock Average Location https://www.cs.umd.edu/hcil/varepository/VAST%20Challenge%202014/challenges/MC2%20-%20Patterns%20of%20Life%20Analysis/entries/Central%20South%20University/
We start by loading all the necessary datasets provided in the VAST Challenge 2021 Mini-Challenge 2.
# Loading all datasets and image
cc <- readr::read_csv("data/cc_data.csv") # Add credit card data
loyalty <- readr::read_csv("data/loyalty_data.csv") # Add loyalty data
mc2 <- raster("data/MC2-tourist_modified.tif") # Add tif file as a raster layer
gps <- readr::read_csv("data/gps.csv") # Add gps data
car <- readr::read_csv("data/car-assignments.csv") # Add car assignments
Abila_st <- st_read(dsn = "data", layer = "Abila")
Kronos_sf <- st_as_sf(st_read(dsn = "data", layer = "Kronos_Island"))
As always, we review each dataset in greater detail. This is a necessary step in order to accurately prepare the data for subsequent use.
While reviewing the four csv data, we immediately noticed a few potential issues:
1. Date format within the timestamp were in a MM-DD-YYYY H:M format
2. Katerina’s Cafe contains unique characters, which may cause downstream problems during our analysis
3. ID and Last4CCNum are treated as regular double numbers, instead of a character type
4. Names and roles are broken up into multiple columns within the car data
Thus, we address these potential issues as well as create and simplify other columns for subsequent ease in analysis.
#--------------- Cleaning CC data ---------------
cc$timestamp <- date_time_parse(cc$timestamp,
zone = "UTC",
format = "%m/%d/%Y %H:%M") # Readjust CC timestamp
cc[grep("Katerina", cc$location),2] <- "Katerina's Cafe" # Replace unique characters in Katerina's Cafe
cc$last4ccnum <- as_factor(cc$last4ccnum) # Change the column format to nominal format
cc$hour <- as.numeric(format(cc$timestamp,"%H")) # Create a separate column just for hours in the cc data
cc$period <- case_when( # Segment hour into 5 separate periods
cc$hour >= 21 ~ "Late Evening 9pm to 11.59pm",
cc$hour >= 18 ~ "Evening 6pm to 8.59pm",
cc$hour >= 12 ~ "Afternoon 12noon to 5.59pm",
cc$hour >= 6 ~ "Morning 6am to 11.59am",
TRUE ~ "Late Night 12mn to 5.59am"
)
cc$period <- factor(cc$period, # Order periods accordingly
levels = c("Morning 6am to 11.59am",
"Afternoon 12noon to 5.59pm",
"Evening 6pm to 8.59pm",
"Late Evening 9pm to 11.59pm",
"Late Night 12mn to 5.59am"))
cc$dayofmonth <- day(cc$timestamp) # Extract day of month from timestamp in a new column
cc$dayofmonth <- as_factor(cc$dayofmonth) # Change to nominal format
cc$weekday <- wday(cc$timestamp, label = TRUE) # Extract day of week from timestamp in a new column
cc <- tibble::rowid_to_column(cc, "ID") # Create a numeric id column
cc$date <- as.Date(cc$timestamp) # Create a separate column just for dates in the cc data
cc$concat_cc_loyalty <- paste(cc$date,cc$location,cc$price) # Create a separate column of unique values using concatenated values in the cc data
cc$concat_cc_spots <- paste(cc$date,cc$location,cc$hour) # Create a second separate column of unique values using concatenated values in the cc data
cc$ID <- as_factor(cc$ID) # Change the column format to nominal format
#--------------- Cleaning Loyalty data ---------------
loyalty$timestamp <- date_time_parse(loyalty$timestamp,
zone = "UTC",
format = "%m/%d/%Y") # Readjust loyalty timestamp
loyalty[grep("Katerina", loyalty$location),2] <- "Katerina's Cafe" # Replace unique characters in Katerina's Cafe
loyalty$dayofmonth <- day(loyalty$timestamp) # Extract day of month from timestamp in a new column
loyalty$dayofmonth <- as_factor(loyalty$dayofmonth) # Change to nominal format
loyalty$weekday <- wday(loyalty$timestamp, label = TRUE) # Extract day of week from timestamp in a new column
loyalty$concat_loyalty_cc <- paste(loyalty$timestamp,loyalty$location,loyalty$price) # Create a separate column of unique values using concatenated values in the loyalty data
loyalty <- tibble::rowid_to_column(loyalty, "ID") # Create a numeric id column
loyalty$ID <- as_factor(loyalty$ID) # Change the column format to nominal format
#--------------- Cleaning Car Assignment data ---------------
car$CarID <- as_factor(car$CarID) # Change the column format to nominal format
car$FullName <- paste(car$FirstName,car$LastName, sep = " ") # Create new column with combined first and last name
car$RoleNName <- paste(car$CarID, car$CurrentEmploymentTitle, car$FullName, sep = " ") # Create new column with combined ID, Role and Full Name
#--------------- Cleaning GPS data ---------------
gps$id <- as_factor(gps$id) # Change the column format to nominal format
gps$Timestamp <- date_time_parse(gps$Timestamp,
zone = "UTC",
format = "%m/%d/%Y %H:%M:%S") # Readjust loyalty timestamp
gps$date <- as_date(gps$Timestamp) # Create a separate column just for dates in the gps data
gps$hour <- hour(gps$Timestamp) # Create a separate column just for hours in the gps data
gps$period <- case_when( # Segment hour into 5 separate periods
gps$hour >= 21 ~ "Late Evening 9pm to 11.59pm",
gps$hour >= 18 ~ "Evening 6pm to 8.59pm",
gps$hour >= 12 ~ "Afternoon 12noon to 5.59pm",
gps$hour >= 6 ~ "Morning 6am to 11.59am",
TRUE ~ "Late Night 12mn to 5.59am"
)
gps$period <- factor(gps$period, # Order periods accordingly
levels = c("Morning 6am to 11.59am",
"Afternoon 12noon to 5.59pm",
"Evening 6pm to 8.59pm",
"Late Evening 9pm to 11.59pm",
"Late Night 12mn to 5.59am"))
gps$dayofmonth <- day(gps$Timestamp) # Extract day of month from timestamp in a new column
gps$day <- as_factor(get_day(gps$Timestamp)) # Change the column format to nominal format
gps$weekday <- wday(gps$Timestamp, label = TRUE) # Extract day of week from timestamp in a new column
We will now attempt to find matching rows between the cc and loyalty data.
cc_loyalty <- cc %>% # Create a new df showing matches with a max distance difference of 1
stringdist_inner_join(loyalty,
by = c("concat_cc_loyalty" = "concat_loyalty_cc"),
method = "osa",
max_dist = 1,
distance_col = "distance")
cc_loyalty_1 <- cc_loyalty %>% # Isolate best matching cc and loyalty with more than 2 counts
group_by(last4ccnum,loyaltynum) %>%
count() %>%
filter(n>2) %>%
ungroup()
cc_loyalty_duplicate_cc <- cc_loyalty_1 %>% # Extract duplicates in cc data
filter(cc_loyalty_1$last4ccnum == cc_loyalty_1$last4ccnum[duplicated(cc_loyalty_1$last4ccnum)])
cc_loyalty_duplicate_loyalty <- subset(cc_loyalty_1,loyaltynum == "L6267" | loyaltynum == "L3288") # Extract duplicates in loyalty data
cc_loyalty_1$type <- "unique"
cc_loyalty_1[which(cc_loyalty_1$last4ccnum == cc_loyalty_duplicate_cc$last4ccnum),4] <- "duplicate"
cc_loyalty_1[which(cc_loyalty_1$loyaltynum == "L6267" |
cc_loyalty_1$loyaltynum == "L3288"),4] <- "duplicate"
First, we will merge the GPS data with the car assignments. Next, we will isolate GPS points, that have been stationary for at least 10 mins.
gps_name <- left_join(gps,car, by = c("id" = "CarID")) # Merge car assignments to gps data
gps_name$Timestamp <- as.POSIXct(gps_name$Timestamp, format = "%m/%d/%Y %H:%M:%S", tz = "GMT") # Timestamp switching to month-day-year format
gps_name <- gps_name[with(gps_name,order(id,Timestamp)),] # Sort first by ID in ascending order and then Timestamp by oldest to newest
gps_name <- gps_name %>% # Add running number in the first column
mutate(No = 1:n()) %>%
dplyr::select(No, everything())
gps_name <- gps_name %>% # Create additional column indicating time taken from previous timestamp for same ID
mutate(Delta = Timestamp - lag(Timestamp, default = first(Timestamp)))
gps_name$Delta <- as.numeric(gps_name$Delta) # Convert Delta column to numeric format
gps_name$Delta_Hours <- round(gps_name$Delta / 60 / 60, 1) # Create column to convert Delta seconds into hours with one decimal place
spots <- gps_name %>% # Filtering out stationary gps coordinates of more than 10 mins
filter(Delta > 600)
spots$No <- rep(1:2965, times = 1) # Redo running number in the first column
Next, using the map and other data sources, we identify the locations of each of these stationary GPS points. Through a visual inspection of the map, credit card and loyalty data, we found 66 unique locations.
spots$Location <- 1 # Create a Location column
spots <- spots %>% mutate( # Create additional column with location names based on latlong
Location = case_when(
between(lat, 36.05092013, 36.05102938) &
between(long, 24.82586806, 24.82598723) ~ "Abila Airport", # 35 features
between(lat, 36.07434876, 36.07443715) &
between(long, 24.84592966, 24.84598782) ~ "Abila Scrapyard", # 4 features
between(lat, 36.06342076, 36.06349309) &
between(long, 24.85096457, 24.85103679) ~ "Abila Zacharo", # 66 features
between(lat, 36.07712237, 36.07715385) &
between(long, 24.87617634, 24.87621582) ~ "Ahaggo Museum", # 5 features
between(lat, 36.07522801, 36.07530344) &
between(long, 24.85626503, 24.85634849) ~ "Albert's Fine Clothing", # 20 features
between(lat, 36.08172086, 36.08182543) &
between(long, 24.85086882, 24.85096705) ~ "Bean There Done That", # 46 features
between(lat, 36.05402149, 36.05413903) &
between(long, 24.90116515, 24.90128202) ~ "Brew've Been Served", # 106 features
between(lat, 36.07332048, 36.07336116) &
between(long, 24.86416419, 24.86420583) ~ "Brewed Awakenings", # 36 features
between(lat, 36.06582469, 36.065941) &
between(long, 24.90097567, 24.90108865) ~ "20 Building Control Stenig's Home", # 20 features
between(lat, 36.05851786, 36.05860144) &
between(long, 24.8808655, 24.88092654) ~ "Carlyle Chemical Inc.", # 30 features
between(lat, 36.07818062, 36.07821857) &
between(long, 24.87211555, 24.8721508) ~ "4 CFO Ingrid's Home", # 27 features
between(lat, 36.07682044, 36.07685752) &
between(long, 24.8658641, 24.86589901) ~ "10 CIO Ada's Home", # 35 features
between(lat, 36.0721156, 36.07215701) &
between(long, 24.87458425, 24.8746267) ~ "32 COO Orhan's Home", # 29 features
between(lat, 36.07062423, 36.07073983) &
between(long, 24.89517609, 24.89526281) ~ "Chostus Hotel", # 11 features
between(lat, 36.05462322, 36.05469486) &
between(long, 24.88977034, 24.88983886) ~ "Coffee Cameleon", # 29 features
between(lat, 36.08954231, 36.08962196) &
between(long, 24.86066508, 24.8607611) ~ "Desafio Golf Course", # 10 features
between(lat, 36.07292088, 36.07301365) &
between(long, 24.88396447, 24.88405897) ~ "26 Drill Site Manager Marin's Home", # 26 features
between(lat, 36.08442031, 36.08449538) &
between(long, 24.86416741, 24.8642387) ~ "7 Drill Technician Elsa's Home", # 25 features
between(lat, 36.08424703, 36.08432477) &
between(long, 24.8563809, 24.8564637) ~ "9 Drill Technician Gustav's Home", # 13 features
between(lat, 36.0726185, 36.07380904) &
between(long, 24.87510166, 24.87613744) ~ "28 Drill Technician Isande's Home", # 26 features
between(lat, 36.06922564, 36.06931513) &
between(long, 24.88416486, 24.88426267) ~ "27 Drill Technician Kare's Home", # 20 features
between(lat, 36.08542073, 36.08550845) &
between(long, 24.86036422, 24.86045943) ~ "2 Engineer Lars's Home", # 37 features
between(lat, 36.08664252, 36.08672442) &
between(long, 24.85756416, 24.85766744) ~ "3 Engineer Felix's Home", # 22 features
between(lat, 36.07622023, 36.07626546) &
between(long, 24.87466429, 24.87471053) ~ "35 Environmental Safety Advisor Willem's Home", # 33 features
between(lat, 36.07212045, 36.07213193) &
between(long, 24.84132949, 24.84134818) ~ "Frank's Fuel", # 2 features
between(lat, 36.05492145, 36.05503511) &
between(long, 24.90176782, 24.90188061) ~ "Frydos Autosupply n' More", # 29 features
between(lat, 36.04802098, 36.04805422) &
between(long, 24.87956497, 24.87957691) ~ "GasTech", # 738 features
between(lat, 36.05970763, 36.05981097) &
between(long, 24.85797552, 24.8580772) ~ "Gelatogalore", # 47 features
between(lat, 36.06034564, 36.06043016) &
between(long, 24.85646426, 24.85657454) ~ "General Grocer", # 12 features
between(lat, 36.05572125, 36.05584094) &
between(long, 24.90246542, 24.90258487) ~ "Guy's Gyros", # 143 features
between(lat, 36.06362146, 36.06371539) &
between(long, 24.88586605, 24.88595859) ~ "Hallowed Grounds", # 70 features
between(lat, 36.07660977, 36.07669909) &
between(long, 24.85756408, 24.85764247) ~ "Hippokampos", # 155 features
between(lat, 36.08412146, 36.08420924) &
between(long, 24.85896842, 24.85905081) ~ "11 Hydraulic Technician Axel's Home", # 23 features
between(lat, 36.08782802, 36.08793196) &
between(long, 24.85627136, 24.8563725) ~ "19 Hydraulic Technician Vira's Home", # 24 features
between(lat, 36.06641679, 36.06650723) &
between(long, 24.88256875, 24.88265687) ~ "1 IT Helpdesk Nils's Home", # 31 features
between(lat, 36.06729646, 36.06736745) &
between(long, 24.87788423, 24.87795559) ~ "5 IT Technician Isak's Home", # 21 features
between(lat, 36.06722012, 36.06731624) &
between(long, 24.8858687, 24.88596759) ~ "8 IT Technician Lucas's Home", # 23 features
between(lat, 36.06749651, 36.0675518) &
between(long, 24.87330651, 24.873366) ~ "Jack's Magical Beans", # 31 features
between(lat, 36.06582037, 36.06584879) &
between(long, 24.85236427, 24.85241027) ~ "Kalami Kafenion", # 47 features
between(lat, 36.05442247, 36.05453641) &
between(long, 24.89986596, 24.89998054) ~ "Katerina's Cafe", # 158 features
between(lat, 36.05292229, 36.05296701) &
between(long, 24.84936915, 24.84941679) ~ "Kronos Capital", # 6 features
between(lat, 36.06582196, 36.06587998) &
between(long, 24.8497762, 24.84983936) ~ "Kronos Mart", # 9 features
between(lat, 36.06523446, 36.06534083) &
between(long, 24.83307421, 24.83318494) ~ "Kronos Pipe and Irrigation", # 7 features
between(lat, 36.06402993, 36.06410072) &
between(long, 24.84137818, 24.84144338) ~ "Maximum Iron and Steel", # 9 features
between(lat, 36.05840347, 36.05849041) &
between(long, 24.88546548, 24.88553455) ~ "Nationwide Refinery", # 41 features
between(lat, 36.05859158, 36.05859887) &
between(long, 24.85790261, 24.85799357) ~ "Octavio's Office Supplies", # 3 features
between(lat, 36.05192066, 36.05197575) &
between(long, 24.87076418, 24.87082137) ~ "Ouzeri Elian", # 67 features
between(lat, 36.06764972, 36.06775002) &
between(long, 24.90243213, 24.9025445) ~ "34 Perimeter Control Edvard's Home", # 20 features
between(lat, 36.06324941, 36.06330782) &
between(long, 24.85226894, 24.8523291) ~ "Roberts and Sons", # 9 features
between(lat, 36.05942407, 36.05952152) &
between(long, 24.89476557, 24.8948649) ~ "Shared Home A - 6 Linnea 25 Kanon 29 Bertrand", # 72 features
between(lat, 36.06332304, 36.06343537) &
between(long, 24.89607033, 24.89617856) ~ "Shared Home B - 14 Lidelse 18 Birgitta 21 Hennie", # 60 features
between(lat, 36.06242283, 36.06253955) &
between(long, 24.89877023, 24.89888179) ~ "Shared Home C - 17 Sven 24 Minke 33 Brand", # 68 features
between(lat, 36.05842222, 36.05853828) &
between(long, 24.90096522, 24.90107874) ~ "Shared Home D - 22 Adra 23 Varja 30 Felix", # 73 features
between(lat, 36.0603222, 36.06044736) &
between(long, 24.90556693, 24.90569385) ~ "Shared Home E - 13 Inga 15 Loreto 16 Isia 21 Hennie", # 85 features
between(lat, 36.05282139, 36.05288367) &
between(long, 24.86856868, 24.8686314) ~ "Shoppers' Delight", # 17 features
between(lat, 36.06772112, 36.06784956) &
between(long, 24.89906521, 24.89917328) ~ "12 Site Control Hideki's Home", # 21 features
between(lat, 36.05409586, 36.05420832) &
between(long, 24.90806584, 24.90817838) ~ "Stewart and Sons Fabrication", # 36 features
between(lat, 36.06774029, 36.06776587) &
between(long, 24.87148791, 24.87150031) ~ "U-Pump", # 4 features
between(lat, 36.05012433, 36.05021624) &
between(long, 24.9003978, 24.90047475) ~ "Anonymous Site 1", # 6 features
between(lat, 36.06314781, 36.06324321) &
between(long, 24.90010823, 24.90018668) ~ "Anonymous Site 2", # 7 features
between(lat, 36.05893131, 36.05900826) &
between(long, 24.89277554, 24.89284962) ~ "Anonymous Site 3", # 7 features
between(lat, 36.08061881, 36.08067087) &
between(long, 24.84681621, 24.84688282) ~ "Anonymous Site 4", # 7 features
between(lat, 36.06944928, 36.0695319) &
between(long, 24.84147082, 24.84157048) ~ "Anonymous Site 5", # 8 features
between(lat, 36.05149231, 36.05253234) &
between(long, 24.87495168, 24.87611086) ~ "Anonymous Site 6", # 13 features
between(lat, 36.05543848, 36.05657576) &
between(long, 24.86618187, 24.86735) ~ "Anonymous Site 7", # 7 features
between(lat, 36.07099038, 36.07200089) &
between(long, 24.86869468, 24.86985682) ~ "Anonymous Site 8", # 10 features
))
spots$concat_spots_cc <- paste(spots$date,spots$Location,spots$hour) # Create a separate column of unique values using concatenated values in the distilled GPS data
spots_median <- spots %>% # Extract the median lat & long coordinates of locations
group_by(Location) %>%
summarise(lat.median = median(lat), long.median = median(long)) %>%
filter(!is.na(Location)) %>% # Exclude remaining few unmatched locations
ungroup()
spots_median <- spots_median %>% # Add additional column to classify locations into major buckets
mutate(Location.Type = case_when(
Location %in% c("Anonymous Site 1",
"Anonymous Site 2",
"Anonymous Site 3",
"Anonymous Site 4",
"Anonymous Site 5",
"Anonymous Site 6",
"Anonymous Site 7",
"Anonymous Site 8") ~ "Unknown",
Location %in% c("Bean There Done That",
"Brew've Been Served",
"Brewed Awakenings",
"Coffee Cameleon",
"Jack's Magical Beans",
"Hallowed Grounds") ~ "Coffee Cafe",
Location %in% c("Abila Zacharo",
"Gelatogalore",
"Guy's Gyros",
"Hippokampos",
"Kalami Kafenion",
"Katerina's Cafe",
"Ouzeri Elian") ~ "Food Joints",
Location %in% c("GasTech") ~ "HQ",
Location %in% c("1 IT Helpdesk Nils's Home",
"10 CIO Ada's Home",
"11 Hydraulic Technician Axel's Home",
"12 Site Control Hideki's Home",
"19 Hydraulic Technician Vira's Home",
"2 Engineer Lars's Home",
"20 Building Control Stenig's Home",
"26 Drill Site Manager Marin's Home",
"27 Drill Technician Kare's Home",
"28 Drill Technician Isande's Home",
"3 Engineer Felix's Home",
"32 COO Orhan's Home",
"34 Perimeter Control Edvard's Home",
"35 Environmental Safety Advisor Willem's Home",
"4 CFO Ingrid's Home",
"5 IT Technician Isak's Home",
"7 Drill Technician Elsa's Home",
"8 IT Technician Lucas's Home",
"9 Drill Technician Gustav's Home",
"Shared Home A - 6 Linnea 25 Kanon 29 Bertrand",
"Shared Home B - 14 Lidelse 18 Birgitta 21 Hennie",
"Shared Home C - 17 Sven 24 Minke 33 Brand",
"Shared Home D - 22 Adra 23 Varja 30 Felix",
"Shared Home E - 13 Inga 15 Loreto 16 Isia 21 Hennie") ~ "Residential",
Location %in% c("Abila Scrapyard",
"Carlyle Chemical Inc.",
"Kronos Pipe and Irrigation",
"Maximum Iron and Steel",
"Nationwide Refinery",
"Stewart and Sons Fabrication") ~ "Industrial",
Location %in% c("Ahaggo Museum",
"Albert's Fine Clothing",
"Kronos Mart",
"Octavio's Office Supplies",
"Shoppers' Delight",
"General Grocer",
"Roberts and Sons") ~ "Leisure & Shopping",
Location %in% c("Abila Airport",
"Chostus Hotel",
"Desafio Golf Course",
"Kronos Capital") ~ "Complex",
Location %in% c("Frank's Fuel",
"Frydos Autosupply n' More",
"U-Pump") ~ "Transport",
))
#sum(is.na(spots$Location))
#length(grep("Frydos Autosupply n' More", spots$Location))
#write_csv(spots,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\spots_latest.csv")
#write_csv(spots_median,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\spots_median.csv")
#write_csv(cc,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\cc_date.csv")
Creating custom map of Abila with the use of tmap
Abila_st_union <- st_union(Abila_st) # Dissolve Abila road network
Abila_st_proj <- st_transform(Abila_st_union, crs = 3857)
Abila_st_buffer <- st_buffer(Abila_st_proj, dist = 25, nQuadSegs = 5, ) # Create a buffer around the dissolved Abila road network
gps_sf <- st_as_sf(gps, coords = c("long", "lat"), crs = 4326) # Changing into a shapefile
spots_median_sf <- st_as_sf(spots_median, coords = c("long.median", "lat.median"), crs = 4326) # Changing into a shapefile
gps_path <- gps_sf %>% # Creating a movement path
group_by(id) %>%
summarize(m = mean(Timestamp),
do_union=FALSE) %>%
left_join(dplyr::select(car,CarID,RoleNName), by = c("id" = "CarID")) %>% #Add in RoleNName column
ungroup() %>%
st_cast("LINESTRING")
# Create blue polygon as background to mimic sea
long.sea <- c(24.91075,24.91075,24.8232,24.8232,24.91075)
lat.sea <- c(36.09543,36.0445,36.0445,36.09543,36.09543)
sea <-data.frame(long.sea, lat.sea)
rm(long.sea)
rm(lat.sea)
sea_sf <- st_as_sf(sea, coords = c("long.sea", "lat.sea"))
st_crs(sea_sf) <- 4326
sea_poly<- st_sf(
st_cast(
st_combine(sea_sf$geometry),"POLYGON"
)
)
# Clip a smaller Kronos island around Abila
Kronos_sf_small <- st_crop(Kronos_sf, c(xmin = 24.8232, xmax = 24.91075, ymin = 36.0445, ymax = 36.09543))
tmap_mode("view")
#tm_shape(mc2) +
# tm_rgb(mc2, r = 1,g = 2,b = 3,
# alpha = NA,
# saturation = 1,
# interpolate = TRUE,
# max.value = 255) +
custom_tmap <- tm_shape(sea_poly) +
tm_polygons(col="lightblue") +
tm_shape(Kronos_sf_small) +
tm_polygons(col = "beige") +
tm_shape(Abila_st_buffer) +
tm_polygons(col = "white") +
tm_shape(gps_path %>% filter(id==2)) +
tm_lines(id = "RoleNName") +
tm_shape(spots_median_sf) +
tm_dots(col = "Location.Type",
id = "Location", # Bold in group
popup.vars = "Location Type:" =="Location.Type",
size = 0.2)
custom_tmap
#class(sea_poly)
#st_cast(sea_sf,"POLYGON")
#class(sea_sf)
#sea_sf <- st_multilinestring(sea_sf)
#st_cast(sea_sf, "POLYGON")
#sea <- matrix(c(24.90976,36.09543,
# 24.90976,36.04499,
# 24.82419,36.04499,
# 24.82419,36.09543,
# 24.90976,36.09543), ncol=2, byrow = TRUE)
#sea_poly <- st_polygon(list(sea))
#sea_poly <- st_set_crs(sea_poly, 4326)
#st_crs(sea_poly)
#class(sea_poly)
#gps_path_selected <- gps_path %>%
# filter(id==3)
#gps_path_selected <- gps_path %>%
# filter(id==1)
#st_crs(Abila_st_union)
#st_crs(Abila_st_buffer)
#print(Abila_st_buffer)
#class(Abila_st)
#Abila_st_proj <- st_transform(Abila_st, 32632) # UTM Zone 32N
#Abila_st_buffer <- st_buffer(Abila_st_proj, dist = 100)#write_sf(Abila_st_union,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\Abila_st_union.shp")
#write_sf(Abila_st_buffer,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\Abila_st_buffer.shp")
#write_sf(sea_poly,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\sea_poly.shp")
Here we will answer the VAST Challenge questions.
Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies? Please limit your answer to 8 images and 300 words.
Food and beverage places, such as Brew’ve Been Served, Guy’s Gyros, Hallowed Grounds etc. seem to be the more popular locations, as highlighted in the dark grey tiles below.
cc_calendar <- cc %>%
count(dayofmonth, location)
cc_calendar$dayofmonth <- as_factor(cc_calendar$dayofmonth)
Q5.1.1 <- ggplot(complete(cc_calendar, dayofmonth, location), aes(x = dayofmonth, y = location)) +
geom_tile(aes(fill = n), color = "white", size = 0.1) +
scale_fill_gradient(low = "light grey", high = "black", na.value = "light grey") +
scale_y_discrete(expand = expansion(add = 1.6),
limits=rev) +
labs(title = "Calendar Heatmap of Location Visit Frequency (From CC Data) By Date",
subtitle = "Food and coffee outlets seem to be the most frequented, based on credit card data",
x = "Day of Month",
fill = "Frequency Of Visit") +
theme_bw() +
theme(axis.ticks = element_blank(),
panel.border = element_blank(),
panel.spacing = unit(0.1, "cm"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
text = element_text(size=7),
axis.title.x = element_text(vjust=-3),
axis.title.y = element_blank(),
legend.position = "bottom")
Q5.1.1

The txn dates of Kronos Mart differ by exactly one day, comparing either side of the credit card and loyalty data. Investigation needed to ascertain true transaction dates of Kronos Mart’s books, perhaps through receipt verification. Looking at the GPS data, single visits on 9th, 11th, 12th, 13th, 15th, 16th and three visits on 18th suggests that these GPS visits matches more to loyalty data than the credit card data.
Q5.1.2_cc <- cc %>%
filter(location == "Kronos Mart") %>%
dplyr::select(dayofmonth, price, location) %>%
group_by(dayofmonth) %>%
summarise(cc_data = sum(price)) %>%
ungroup()
Q5.1.2_loyalty <- loyalty %>%
filter(location == "Kronos Mart") %>%
dplyr::select(dayofmonth, price, location) %>%
group_by(dayofmonth) %>%
summarise(loyalty_data = sum(price)) %>%
ungroup()
Q5.1.2_combined <- data.frame(dayofmonth = c(6:19))
Q5.1.2_combined$dayofmonth <- as_factor(Q5.1.2_combined$dayofmonth)
Q5.1.2_combined <- Q5.1.2_combined %>%
left_join(Q5.1.2_cc, by = "dayofmonth") %>%
left_join(Q5.1.2_loyalty, by = "dayofmonth")
Q5.1.2_combined$cc_data[is.na(Q5.1.2_combined$cc_data)] <- 0
Q5.1.2_combined$loyalty_data[is.na(Q5.1.2_combined$loyalty_data)] <- 0
Q5.1.2_combined <-melt(Q5.1.2_combined, id.vars = "dayofmonth", variable.name = "source")
Q5.1.2 <- ggplot(Q5.1.2_combined, aes(dayofmonth, value, group = source)) +
geom_area(aes(colour = source, fill = source),
size = 1) +
geom_point() +
geom_text(data=subset(Q5.1.2_combined, value != 0),
aes(label = value,
group = source),
vjust = -1,
size = 3) +
facet_grid(source~.) +
ylim(0,500) +
labs("title" = "Kronos Mart's Suspicious Delayed Transactions",
"subtitle" = "Loyalty transactions in Kronos Mart recorded one day earlier than in credit card") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.position = "none")
Q5.1.2

Q5.1.2_gps <- tibble("dayofmonth" = c(6:19)) %>%
left_join(spots %>%
group_by(Location, dayofmonth) %>%
tally() %>%
filter(Location == "Kronos Mart") %>%
ungroup(), by = "dayofmonth") %>%
mutate(n2=n) %>%
replace_na(list(n=0))
Q5.1.2a <- ggplot(Q5.1.2_gps,
aes(x = dayofmonth, y = n)) +
geom_area(size = 1) +
geom_point() +
geom_text(aes(label = n2), na.rm = TRUE,
vjust = -1,
size = 3) +
scale_x_continuous(breaks = seq(6,19,1)) +
ylim(0,5) +
labs("title" = "GPS Movements to Kronos Mart Validates Loyalty Data",
"subtitle" = "GPS data seem to validate the loyalty data, more than the credit card data") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.position = "none")
Q5.1.2a

Ascertaining the location of unknown locations such as Hippokampos and Abila Zacharo seem tricky, given that their location names do not describe its very nature. Thus, we’re forced to rely on their time-based transactions to approximate the nature of their locations.
Q5.1.3 <- ggplot(cc,
aes(x = hour,
y = location,
fill = stat(x)
)) +
geom_density_ridges_gradient(scale=3,rel_min_height = 0.001) +
scale_x_continuous(breaks = 0:24) +
scale_y_discrete(limits=rev) +
scale_fill_viridis_c(name = "ABC", option = "A") +
theme_ridges(font_size = 7, grid = TRUE) +
theme(legend.position = "none") +
labs(title = "Uncovering Location Type Beyond Ambiguous Location Names",
subtitle = "High Noon Txns Suggests Abila Zacharo and Hippokampos As Food Outlets")
Q5.1.3

Coffee chains usually open for longer than just the three hours we see in the data, given the traditionally low beverage costing.
Q5.1.4_cc <- cc %>%
left_join(dplyr::select(spots_median,Location, Location.Type), by = c("location" = "Location")) %>%
filter(Location.Type == "Coffee Cafe") %>%
dplyr::select(location, hour, price) %>%
group_by(location, hour) %>%
summarise(coffee_money = sum(price), .groups = "drop") %>%
ungroup() %>%
dcast(hour ~ location, value.var = "coffee_money")
Q5.1.4_cc$hour <- as_factor(Q5.1.4_cc$hour)
Q5.1.4_combined <- data.frame(hour = c(1:24))
Q5.1.4_combined$hour <- as_factor(Q5.1.4_combined$hour)
Q5.1.4_combined <- Q5.1.4_combined %>%
left_join(Q5.1.4_cc, by = "hour")
Q5.1.4_combined <-melt(Q5.1.4_combined, id.vars = "hour", variable.name = "coffee_place")
Q5.1.4 <- ggplot(Q5.1.4_combined, aes(hour, value, fill = coffee_place)) +
geom_bar(stat = "identity") +
coord_polar(theta = "x") +
labs(title = "Daily CC Txns At Coffee Chains Restricted To Only Three Hours",
subtitle = "Three Coffee Chains Have CC Txns Only At Noon") +
xlab("") +
ylab("") +
theme(
axis.ticks = element_blank(),
axis.text.y = element_blank(),
panel.background = element_blank(),
panel.grid.major.x = element_line(colour="grey"),
axis.text.x = element_text(size = 15),
legend.title=element_blank())
Q5.1.4

Our original dataset contained 55 credit card numbers and 54 loyalty card numbers respectively. As part of our fuzzy matching, we were able to complete a 1-to-1 match of 49 pairs of credit and loyalty cards. The remaining cards were found to have a 1-to-2 matching relationship. More investigation would need to be done on these 1-to-2 matches.
Q5.1.5_label_cc <- data.frame("id" = 1:54,
"code" = as_factor(cc_loyalty_1$last4ccnum))
Q5.1.5_label_loyalty <- data.frame("id" = 55:108,
"code" = cc_loyalty_1$loyaltynum)
Q5.1.5_label <- bind_rows(Q5.1.5_label_cc,
Q5.1.5_label_loyalty)
Q5.1.5_label <- subset(Q5.1.5_label, select = -1 )
Q5.1.5 <- ggparcoord(cc_loyalty_1,
columns = 1:2,
groupColumn = 4,
showPoints = TRUE,
alphaLines = 1) +
geom_text(aes(label= Q5.1.5_label$code),
size = 3,
nudge_x = 0.06) +
scale_color_manual(values=c( "#172623", "#E8E8E8")) +
theme_minimal() +
scale_y_discrete(breaks = NULL) +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "bottom") +
labs(title = "Credit Card and Loyalty Number Mostly Matched One-To-One",
subtitle = "Two Loyalty Numbers Are Each Attached To Two Different Credit Cards; \nOne Credit Card Linked To Two Different Loyalty Numbers")
Q5.1.5

Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find? Please limit your answer to 8 images and 500 words.
Q5.2.1a_weekday <- data.frame("weekday" = unique(cc[c('weekday')])) %>%
slice(rep(1:n(), each = 5))
Q5.2.1a_period <- data.frame("period" = unique(cc[c('period')]))
Q5.2.1a_period$period <- factor(Q5.2.1a_period$period,
levels = c("Morning 6am to 11.59am",
"Afternoon 12noon to 5.59pm",
"Evening 6pm to 8.59pm",
"Late Evening 9pm to 11.59pm",
"Late Night 12mn to 5.59am"))
Q5.2.1a_period <- as.data.frame(lapply(Q5.2.1a_period,rep,7))
Q5.2.1a_combined <- cbind(Q5.2.1a_weekday,Q5.2.1a_period)
Q5.2.1a_cc <- cc %>%
group_by(weekday,period) %>%
tally() %>%
ungroup()
Q5.2.1a_combined <- Q5.2.1a_combined %>%
left_join(Q5.2.1a_cc, by = c("weekday"="weekday","period"="period"))
Q5.2.1a_combined$id <- seq(1, nrow(Q5.2.1a_combined))
Q5.2.1a_combined[36:63,] <- NA
Q5.2.1a_combined[36,4] <- 5.1
Q5.2.1a_combined[37,4] <- 5.2
Q5.2.1a_combined[38,4] <- 5.3
Q5.2.1a_combined[39,4] <- 5.4
Q5.2.1a_combined[40,4] <- 10.1
Q5.2.1a_combined[41,4] <- 10.2
Q5.2.1a_combined[42,4] <- 10.3
Q5.2.1a_combined[43,4] <- 10.4
Q5.2.1a_combined[44,4] <- 15.1
Q5.2.1a_combined[45,4] <- 15.2
Q5.2.1a_combined[46,4] <- 15.3
Q5.2.1a_combined[47,4] <- 15.4
Q5.2.1a_combined[48,4] <- 20.1
Q5.2.1a_combined[49,4] <- 20.2
Q5.2.1a_combined[50,4] <- 20.3
Q5.2.1a_combined[51,4] <- 20.4
Q5.2.1a_combined[52,4] <- 25.1
Q5.2.1a_combined[53,4] <- 25.2
Q5.2.1a_combined[54,4] <- 25.3
Q5.2.1a_combined[55,4] <- 25.4
Q5.2.1a_combined[56,4] <- 30.1
Q5.2.1a_combined[57,4] <- 30.2
Q5.2.1a_combined[58,4] <- 30.3
Q5.2.1a_combined[59,4] <- 30.4
Q5.2.1a_combined[60,4] <- 35.1
Q5.2.1a_combined[61,4] <- 35.2
Q5.2.1a_combined[62,4] <- 35.3
Q5.2.1a_combined[63,4] <- 35.4
Q5.2.1a_combined <- Q5.2.1a_combined %>%
arrange(id)
Q5.2.1a_combined$id <- seq(1, nrow(Q5.2.1a_combined))
Q5.2.1a_combined$period <- factor(Q5.2.1a_combined$period,
levels = c("Morning 6am to 11.59am",
"Afternoon 12noon to 5.59pm",
"Evening 6pm to 8.59pm",
"Late Evening 9pm to 11.59pm",
"Late Night 12mn to 5.59am"))
Q5.2.1a_label <- Q5.2.1a_combined
Q5.2.1a_number_of_bar <- nrow(Q5.2.1a_label)
Q5.2.1a_angle <- 90 - 360 * (Q5.2.1a_label$id-0.5) /Q5.2.1a_number_of_bar
Q5.2.1a_label$hjust <- ifelse(Q5.2.1a_angle < -90, 1, 0)
Q5.2.1a_label$angle <- ifelse(Q5.2.1a_angle < -90, Q5.2.1a_angle+180, Q5.2.1a_angle)
Q5.2.1a_base <- Q5.2.1a_combined %>%
group_by(weekday) %>%
summarize(start=min(id), end=max(id) - 4) %>%
rowwise() %>%
mutate(title=mean(c(start, end))) %>%
ungroup()
Q5.2.1a_grid <- Q5.2.1a_base
Q5.2.1a_grid$end <- Q5.2.1a_grid$end[ c( nrow(Q5.2.1a_grid), 1:nrow(Q5.2.1a_grid)-1)] + 1
Q5.2.1a_grid$start <- Q5.2.1a_grid$start - 1
Q5.2.1a_grid <- Q5.2.1a_grid[-1,]
Q5.2.1a <- ggplot(Q5.2.1a_combined, aes(x=as_factor(id), y=n, fill=period)) +
geom_bar(aes(x=as_factor(id), y=n, fill=period), stat="identity", alpha=0.5) +
geom_segment(data=Q5.2.1a_grid, aes(x = end, y = 120, xend = start, yend = 120), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
geom_segment(data=Q5.2.1a_grid, aes(x = end, y = 90, xend = start, yend = 90), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
geom_segment(data=Q5.2.1a_grid, aes(x = end, y = 60, xend = start, yend = 60), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
geom_segment(data=Q5.2.1a_grid, aes(x = end, y = 30, xend = start, yend = 30), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
annotate("text", x = rep(max(Q5.2.1a_combined$id),4), y = c(30, 60, 90, 120), label = c("30", "60", "90", "120") , color="grey", size=3 , angle=0, fontface="bold", hjust=1) +
geom_bar(aes(x=as_factor(id), y=n, fill=period), stat="identity", alpha=0.5) +
ylim(-100,150) +
theme_minimal() +
theme(axis.text = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
plot.margin = unit(rep(-1,4), "cm")) +
coord_polar() +
geom_text(data=Q5.2.1a_label, aes(x=id, y=n+10, label=n, hjust=hjust), color="black", fontface="bold",alpha=0.6, size=3, angle= Q5.2.1a_label$angle, inherit.aes = FALSE ) +
geom_segment(data=Q5.2.1a_base, aes(x = start, y = -5, xend = end, yend = -5), colour = "black", alpha=0.8, size=0.6 , inherit.aes = FALSE ) +
geom_text(data=Q5.2.1a_base, aes(x=title, y=-18, label=weekday), color="black", fontface="bold",alpha=0.6, size=3, inherit.aes = FALSE )
Q5.2.1a

tmap_mode("view")
Q5.2.3 <- tm_shape(sea_poly) +
tm_polygons(col="lightblue") +
tm_shape(Kronos_sf_small) +
tm_polygons(col = "beige") +
tm_shape(Abila_st_buffer) +
tm_polygons(col = "white") +
tm_shape(gps_path %>% filter(id==28)) + # Extract Isande's path
tm_lines(col = "black",
lty = 1,
id = "RoleNName") +
tm_shape(spots_median_sf) +
tm_dots(col = "Location.Type",
id = "Location", # Bold in group
popup.vars = "Location Type:" =="Location.Type",
size = 0.2)
Q5.2.3
Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data? Please limit your answer to 8 images and 500 words.
We will find matching IDs between the credit card and spots data. Spots data are basically GPS points that have remained stationary for more than 10 mins. The main columns of comparison are the day of month, location and hour. Identical rows are determined based on a max of 1 character edit. From here, we identify the match, by counting the max number of matchings rows between credit card and spots data.
cc_spots <- cc %>% # Create a new df that shows matches with a max distance difference of 1
stringdist_inner_join(spots,
by = c("concat_cc_spots" = "concat_spots_cc"),
method = "osa",
max_dist = 1,
distance_col = "distance")
cc_spots_1 <- cc_spots %>% # Isolate best matching cc and spots with more than 2 counts
filter(!is.na(FullName)) %>% # Remove unknown drivers
group_by(RoleNName,last4ccnum) %>%
count() %>%
arrange(RoleNName,-n) %>% # Arrange the highest to lowest count in each group
ungroup()
cc_spots_1[1,3] <- "matches"
cc_summary <- cc %>%
group_by(last4ccnum) %>%
count() %>%
ungroup()
cc_spots_1 <- cc_spots_1[!duplicated(cc_spots_1$RoleNName),] # Isolating 1 cc to 1 driver
cc_spots_1$last4ccnum[duplicated(cc_spots_1$last4ccnum)] # Need to remove five duplicates
#write_csv(cc_spots_1,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\cc_spots_1.csv")
#write_csv(cc_loyalty_1,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\cc_loyalty_1.csv")
#cc_loyalty_duplicate_cc <- cc_loyalty_1 %>% # Extract duplicates in cc data
# filter(cc_loyalty_1$last4ccnum == cc_loyalty_1$last4ccnum[duplicated(cc_loyalty_1$last4ccnum)])
#cc_loyalty_duplicate_loyalty <- subset(cc_loyalty_1,loyaltynum == "L6267" | loyaltynum == "L3288") # Extract duplicates in loyalty data
Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships. Please limit your response to 8 images and 500 words.
tmap_mode("view")
Q5.4.1 <- tm_shape(sea_poly) +
tm_polygons(col="lightblue") +
tm_shape(Kronos_sf_small) +
tm_polygons(col = "beige") +
tm_shape(Abila_st_buffer) +
tm_polygons(col = "white") +
tm_shape(gps_path %>% filter(id==7)) + # Extract Elsa's path
tm_lines(col = "black",
lty = 1,
id = "RoleNName") +
tm_shape(gps_path %>% filter(id==33)) + # Extract Brand's path
tm_lines(col = "blue",
lty = 1,
id = "RoleNName") +
tm_shape(spots_median_sf) +
tm_dots(col = "Location.Type",
id = "Location", # Bold in group
popup.vars = "Location Type:" =="Location.Type",
size = 0.2)
Q5.4.1
tmap_mode("view")
Q5.4.2 <- tm_shape(sea_poly) +
tm_polygons(col="lightblue") +
tm_shape(Kronos_sf_small) +
tm_polygons(col = "beige") +
tm_shape(Abila_st_buffer) +
tm_polygons(col = "white") +
tm_shape(gps_path %>% filter(id==21)) + # Extract Hennie's path
tm_lines(col = "black",
lty = 1,
id = "RoleNName") +
tm_shape(spots_median_sf %>%
# filter(grepl("Hennie", "Location")) +
filter(Location == "Shared Home B - 14 Lidelse 18 Birgitta 21 Hennie" | Location == "Shared Home E - 13 Inga 15 Loreto 16 Isia 21 Hennie")) +
# filter(Location.Type == "Residential")) +
tm_dots(col = "green",
size = 0.2)
Q5.4.2
Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why. Please limit your response to 10 images and 500 words.
Suspicious activities can be in the following form: 1) Explore presence of two or more individuals at the same location at the same hour for extended periods 2) Individuals Frequenting Unusual Places At Abnormal Hours
tmap_mode("view")
Q5.5.1 <- tm_shape(mc2) +
tm_rgb(mc2, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(spots_median_sf %>%
filter(Location.Type != "Unknown")) +
tm_dots(col = "Location.Type",
id = "Location", # Bold in group
popup.vars = "Location Type:" =="Location.Type",
size = 0.2) +
tm_shape(spots_median_sf %>%
filter(Location.Type == "Unknown")) +
tm_dots(col = "black",
id = "Location", # Bold in group
popup.vars = "Location Type:" =="Location.Type",
size = 0.2)
Q5.5.1
Showcasing only residential points, Bodrogi (ID: 15, black line), Vann (ID: 16, blue line), Osvaldo (ID:21, purple line) and Mies (ID:24, red line) were seen patroling key executives’ houses located near the centre area. (Hover over the lines and points to see the ID and owner of each residence)
tmap_mode("view")
Q5.5.2 <- tm_shape(sea_poly) +
tm_polygons(col="lightblue") +
tm_shape(Kronos_sf_small) +
tm_polygons(col = "beige") +
tm_shape(Abila_st_buffer) +
tm_polygons(col = "white") +
tm_shape(gps_path %>% filter(id==15)) + # Extract Bodrogi's path
tm_lines(col = "black",
lty = 1,
id = "RoleNName") +
tm_shape(gps_path %>% filter(id==16)) + # Extract Vann's path
tm_lines(col = "blue",
lty = 1,
id = "RoleNName") +
tm_shape(gps_path %>% filter(id==21)) + # Extract Osvaldo's path
tm_lines(col = "purple",
lty = 1,
id = "RoleNName") +
tm_shape(gps_path %>% filter(id==24)) + # Extract Mies's path
tm_lines(col = "red",
lty = 1,
id = "RoleNName") +
tm_shape(spots_median_sf %>%
filter(Location.Type == "Residential")) +
tm_dots(col = "green",
size = 0.2)
Q5.5.2
If you solved this mini-challenge in 2014, how did you approach it differently this year?
We did not attempt this mini-challenge in 2014.
#—————————————————–
tmap_mode("view")
Q5.5.2 <- tm_shape(sea_poly) +
tm_polygons(col="lightblue") +
tm_shape(Kronos_sf_small) +
tm_polygons(col = "beige") +
tm_shape(Abila_st_buffer) +
tm_polygons(col = "white") +
tm_shape(gps_path %>% filter(id==15)) + # Extract Bodrogi's path
tm_lines(col = "black",
lty = 1,
id = ) +
tm_shape(gps_path %>% filter(id==16)) + # Extract Vann's path
tm_lines(col = "blue",
lty = 1) +
tm_shape(gps_path %>% filter(id==21)) + # Extract Osvaldo's path
tm_lines(col = "pink",
lty = 1) +
tm_shape(gps_path %>% filter(id==24)) + # Extract Mies's path
tm_lines(col = "red",
lty = 1) +
tm_shape(spots_median_sf %>%
filter(Location.Type == "Residential")) +
tm_dots(col = "green",
size = 0.2)
Q5.5.2
# tm_dots(col = "Location.Type",
# id = "Location", # Bold in group
# popup.vars = "Location Type:" =="Location.Type",
# size = 0.2)
Testing the use of a static image as background and showcasing the gps data on it using ggplot
# Mapping map and gps together specifically for CarID #1
#mapping <- ggplot(gps_name %>%
# filter(id == "1"),
# aes(long, lat)) +
# annotation_custom(rasterGrob(mc2,
# width = unit(1, "npc"),
# height = unit(1,"npc")),
# xmin = 24.8244, xmax = 24.9096, ymin = 36.0453, ymax = 36.0952) + #Original searching
# xmin = 24.82419, xmax = 24.90976, ymin = 36.04499, ymax = 36.09543) + #Prof's raster extraction
# geom_point(size = 0.1) +
# coord_fixed(xlim = c(24.8244, 24.9096), ylim = c(36.0453, 36.0952)) + # Fixing the scales regardless of filtering of points
# theme_bw() + theme(panel.border = element_blank(), # Remove background and grids and reformat scales and axis
# panel.grid.major = element_blank(),
# panel.grid.minor = element_blank(),
# axis.line = element_line(colour = "black"))
# + transition_time(Timestamp) +
# labs(title = "Date:{frame_time}")
#mapping
Past work:
We will use fuzzy string matching using Levenshtein distance which is available natively in R’s adist utilities package.
dist.concat <- adist(cc$concat,loyalty$concat, partial = TRUE, ignore.case = TRUE) #Creates a matrix with the Standard Levenshtein distance between both newly created concat columns
min.concat <- apply(dist.concat, 1, min) #Extract pairs with minimum distance
match.s1.s2 <- NULL
for (i in 1:nrow(dist.concat))
{
s2.i <- match(min.concat[i], dist.concat[i,])
s1.i <- i
match.s1.s2 <- rbind(data.frame(loyalty.i=s2.i,
cc.i=s1.i,
loyalty_concat=loyalty[s2.i,]$concat,
cc_concat=cc[s1.i,]$concat,
adist=min.concat[i]),match.s1.s2)
}
cc_loyalty <- match.s1.s2 %>%
left_join(dplyr::select(cc, last4ccnum, ID), by = c("cc.i" = "ID")) %>% #Add in CC num column
left_join(dplyr::select(loyalty, loyaltynum, ID), by = c("loyalty.i" = "ID")) #Add in loyalty card num column
Let’s now extract the matching credit card-loyalty pairs according to 80% matching of their comparative distance.
cc_loyalty_unique <- dcast(cc_loyalty, last4ccnum + loyaltynum ~ adist) #Long to wide by transposing adist
cc_loyalty_unique$Total <- rowSums(cc_loyalty_unique[,c("0","1","2","3","4","5","11")]) #Sum all rows
cc_loyalty_unique$Sum01 <- rowSums(cc_loyalty_unique[,c("0","1")]) #Sum only column 1 and 2
cc_loyalty_unique$MatchPctTotal <- percent(cc_loyalty_unique[,3] / cc_loyalty_unique$Total) #Calc % of perfect (0) matches against Total
cc_loyalty_unique$MatchPct01 <- percent(cc_loyalty_unique$Sum01 / cc_loyalty_unique$Total) #Calc % of perfect (0) and almost perfect (0) matches against Total
cc_loyalty_unique_80 <- cc_loyalty_unique %>%
filter (MatchPct01 >= "80.00%")
n_distinct(cc_loyalty_unique_80$last4ccnum)
n_distinct(cc_loyalty_unique_80$loyaltynum)
write_csv(cc_loyalty_unique_80,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\cc_loyalty_unique_80.csv")
dist.concat2 <- adist(cc$concat2,spots$concat, partial = TRUE, ignore.case = TRUE) #Create a matrix with the Standard Levenshtein distance between both newly created concat columns
min.concat2 <- apply(dist.concat2, 1, min) #Extract pairs with minimum distance
match.s3.s4 <- NULL
for (j in 1:nrow(dist.concat2))
{
s4.j <- match(min.concat2[j], dist.concat2[j,])
s3.j <- j
match.s3.s4 <- rbind(data.frame(spots.j=s4.j,
cc.j=s3.j,
spots_concat=spots[s4.j,]$concat,
cc_concat2=cc[s3.j,]$concat2,
adist=min.concat2[j]),match.s3.s4)
}
cc_spots <- match.s3.s4 %>%
left_join(dplyr::select(cc, last4ccnum, ID), by = c("cc.j" = "ID")) %>% #Add in CC num column
left_join(dplyr::select(spots, RoleNName, No), by = c("spots.j" = "No")) #Add in Spots num column
cc_spots_unique <- dcast(cc_spots, last4ccnum + RoleNName ~ adist) #Long to wide by transposing adist
cc_spots_unique$Total <- rowSums(cc_spots_unique[,c("0","1","2","3","7","8","9","10","11","12","13","14","15","16","17","18","19","21")]) #Sum all rows
cc_spots_unique$Sum01 <- rowSums(cc_spots_unique[,c("0","1")]) #Sum only column 1 and 2
cc_spots_unique$MatchPctTotal <- percent(cc_spots_unique[,3] / cc_spots_unique$Total) #Calc % of perfect (0) matches against Total
cc_spots_unique$MatchPct01 <- percent(cc_spots_unique$Sum01 / cc_spots_unique$Total) #Calc % of perfect (0) and almost perfect (0) matches against Total
cc_spots_unique_80 <- cc_spots_unique %>%
filter (MatchPct01 >= "80.00%")
n_distinct(cc_spots_unique_80$last4ccnum)
n_distinct(cc_spots_unique_80$RoleNName)
write_csv(cc_loyalty_unique_80,"C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\cc_loyalty_unique_80.csv")
Using fuzzyjoin (use Dcast median values)
spots_cc <- stringdist_join(cc,spots,
by = "concat_cc_spots",
mode = "left",
method = "jw",
ignore_case = TRUE,
max_dist = 0.2,
distance_col = "dist")
spots_cc_summary <- spots_cc %>%
group_by(concat_cc_spots.x) %>%
slice_min (order_by = dist, n = 1) %>%
arrange (desc(dist)) %>%
filter(dist < 0.1) %>%
ungroup()
write_csv(spots_cc, "C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\spots_cc.csv")
write_csv(spots_cc, "C:\\Users\\syeda\\OneDrive\\Documents\\SMU Courses\\2021T3 ISSS608 Visual Analytics and Applications\\3. Submissions\\Assignments\\Sandbox\\spots_cc_summary.csv")
With this, we then used the spots data above to find the coordinates nearest to these locations. We recorded the coordinates of each location in an Excel file, and would now bring it into R.
spot_list <- read_excel("data/spots and list.xlsx", sheet = "LIST") #Import Excel file
Change to GPS lines
#gps <- readr::read_csv("data/gps.csv") # Add gps data
#gps$Timestamp <- date_time_parse(gps$Timestamp,
# zone = "",
# format = "%m/%d/%Y %H:%M:%S")
#spots_sf <- st_as_sf(spots, coords = c("long", "lat"), # Changing into a shapefile
# crs = 4326, agr = "constant")
#spots_l <- spots[,4:5] # Extract only lat long
#spots_ll <- spots_l %>% slice(rep(1:n(), each = 3648))
#spots_ll$No <- rep(1:3648, times = 3648)
#spots_l$Noo <- rep(1:3648, times = 1)
#spots_ll <- left_join(spots_ll,spots_l, by = c("No" = "Noo"))
#spots_ll$Diff <- sqrt((spots_ll$lat.x-spots_ll$lat.y)^2+(spots_ll$long.x-spots_ll$long.y)^2)
#n_perc(spots_ll$Diff < 0.05)
#write_csv(spots,"C:\\Users\\syeda\\OneDrive\\Desktop\\spots.csv")
# Mapping map and gps together specifically for CarID #1
#mapping_spots <- ggplot(spots, aes(long, lat)) +
# annotation_custom(rasterGrob(mc2,
# width = unit(1, "npc"),
# height = unit(1,"npc")),
# xmin = 24.8244, xmax = 24.9096, ymin = 36.0453, ymax = 36.0952) +
# geom_point(size = 0.1) +
# coord_fixed(xlim = c(24.8244, 24.9096), ylim = c(36.0453, 36.0952)) + # Fixing the scales regardless of filtering of points
# theme_bw() + theme(panel.border = element_blank(), # Remove background and grids and reformat scales and axis
# panel.grid.major = element_blank(),
# panel.grid.minor = element_blank(),
# axis.line = element_line(colour = "black"))
#mapping_spots
cc_calendar_one <- cc %>%
filter(location == "Katerina's Cafe") %>%
count(date,hour)
cc_calendar_ggplot_one <- ggplot(complete(cc_calendar_one, date, hour),
aes(x = date, y = hour)) +
geom_tile(aes(fill = n), color = "white", size = 0.1) +
scale_fill_gradient(low = "light grey", high = "black", na.value = "light grey") +
scale_x_date(date_labels = "%a \n %d %b",
date_breaks = "1 day") +
scale_y_reverse() +
labs(title = paste(loc1,"Visit Frequency By Date And Hour"),
fill = "Frequency \n Of Visit") +
theme_bw() +
theme(plot.title = element_text(hjust=0.5),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank())
cc_calendar_ggplot_one